|
A canonical Huffman code is a particular type of Huffman code with unique properties which allow it to be described in a very compact manner. Data compressors generally work in one of two ways. Either the decompressor can infer what codebook the compressor has used from previous context, or the compressor must tell the decompressor what the codebook is. Since a canonical Huffman codebook can be stored especially efficiently, most compressors start by generating a "normal" Huffman codebook, and then convert it to canonical Huffman before using it. In order for a symbol code scheme such as the Huffman code to be decompressed, the same model that the encoding algorithm used to compress the source data must be provided to the decoding algorithm so that it can use it to decompress the encoded data. In standard Huffman coding this model takes the form of a tree of variable-length codes, with the most frequent symbols located at the top of the structure and being represented by the fewest number of bits. However, this code tree introduces two critical inefficiencies into an implementation of the coding scheme. Firstly, each node of the tree must store either references to its child nodes or the symbol that it represents. This is expensive in memory usage and if there is a high proportion of unique symbols in the source data then the size of the code tree can account for a significant amount of the overall encoded data. Secondly, traversing the tree is computationally costly, since it requires the algorithm to jump randomly through the structure in memory as each bit in the encoded data is read in. Canonical Huffman codes address these two issues by generating the codes in a clear standardized format; all the codes for a given length are assigned their values sequentially. This means that instead of storing the structure of the code tree for decompression only the lengths of the codes are required, reducing the size of the encoded data. Additionally, because the codes are sequential, the decoding algorithm can be dramatically simplified so that it is computationally efficient. ==Algorithm== The normal Huffman coding algorithm assigns a variable length code to every symbol in the alphabet. More frequently used symbols will be assigned a shorter code. For example, suppose we have the following ''non''-canonical codebook: A = 11 B = 0 C = 101 D = 100 Here the letter A has been assigned 2 bits, B has 1 bit, and C and D both have 3 bits. To make the code a ''canonical'' Huffman code, the codes are renumbered. The bit lengths stay the same with the code book being sorted ''first'' by codeword length and ''secondly'' by alphabetical value: B = 0 A = 11 C = 101 D = 100 Each of the existing codes are replaced with a new one of the same length, using the following algorithm: * The ''first'' symbol in the list gets assigned a codeword which is the same length as the symbol's original codeword but all zeros. This will often be a single zero ('0'). * Each subsequent symbol is assigned the next binary number in sequence, ensuring that following codes are always higher in value. * When you reach a longer codeword, then ''after'' incrementing, append zeros until the length of the new codeword is equal to the length of the old codeword. This can be thought of as a left shift. By following these three rules, the ''canonical'' version of the code book produced will be: B = 0 A = 10 C = 110 D = 111 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Canonical Huffman code」の詳細全文を読む スポンサード リンク
|